http://economicsofmalaria.com joebrew@gmail.com


This document contains some basic preliminary analysis of data relating to malaria, IRS, and absenteeism at Maragra Açucar, in Maragra, Mozambique. To explore a specific section, click any of the below tabs.

Summary

Description of data

The “Maragra database” consists of the following 9 tables:

Table name Number or rows Number of columns
ab 80988 12
ab_panel 1797032 7
bairros 255 6
census 328283 38
clinic 3403 9
clinic_agg 138 8
irs 3391735 4
mc 11578 15
workers 14195 29

These tables all constitute three kinds of data: those pertaining to medical outcomes (generally referred to as “clinic” data), those pertaining to worker information (demographic and absenteeism-related), and those pertaining to malaria control activities.

These datasets are all readily available in the (private) maragra R package (access credentials given to collaborators upon joebrew@gmail.com). All datasets except for ab_panel are as is (ie, in their original raw format), except for 4 kinds of changes:

  1. Removal of unecessary information (columns deemed irrelevant to the study have been removed).
  2. Modification of column names (all data now employ underscores instead of spaces, and are all lowercase).
  3. Standardization of data formats (all dates are now in a YYYY-MM-DD format, all names use similar capitalization, irrelevant tablet-created timestamps, etc. have been removed).
  4. “Feature generation”: certain columns were created in order to ease grouped analysis and merging of datasets (for example, year_month, day_number, etc.).

The ab_panel dataset is not raw data; rather, it is an amolgamation of the ab and workers datasets, using absence data from the former with worker elibility dates from the latter, so as to create a “panel” style dataset (ie, one row for each day in which a worker was supposed to work).

The clinic and clinic_agg are similar, but not identical, in form. The former has individual-level data (useful for pairing with absences and demographic census data), whereas the latter is simply the raw counts of cases by nationality over time. Though clinic is much richer and more detailed than clinic_agg, is only covers the period from 2014 through 2016, whereas the latter covers a larger time period (2010-2016).

For a “deep dive” into each kind of data, click the tabs at the top of this page.

In order to reproduce this analysis (and others related to these datasets), follow the instructions on this project’s research compendium page.

IRS data

Data structure

Data about “fumigações” (indoor residual spraying or “IRS”) dates from 2011-10-30 through 2017-07-23. The dataset consists of the following fields:

date
insecticida
casas_cobertas
pulverizados
meta_instance_name
unidade
casas_total
month
year
dow
day
day_number
year_month
longitude_aura
latitude_aura

Each row is one fumigation activity. The location/residence key is the unidade column. The casas_cobertas column indicates the number of houses in that unidade that were sprayed, wereas the pulverizados column indicates the number of rooms.

Insecticide type

The insecticide used is either DDT or ACT. The below is a breakdown of their respective use.

The respective use over time is as follows:

Below is a table of the same data.

IRS seasonality

We can examine the date of fumigations to see if they are seasonal vs. randomly/uniformly distributed throughout the year.

Absenteeism data

Data structure

The analysis of absenteeism data will rely on a panel-style dataset in which one row exists for each worker-day (for which the worker is estimated to have supposed to work), with columns indicating the outcome (absent or not, sick or not, etc.). The Maragra CRM does not natively store panel style data, so we construct it from a combination of the workers dataset (from the Human Resources department) and the absenteeism dataset. Certain features pertaining to illness are merged from the clinic dataset.

The ab_panel dataset has the following column names:

oracle_number
date
leave_type
leave_taken
absent
absent_sick
unidade

Data magnitude

Overall, we have observed 1797032 eligible worker days (the equivalent of 4920 years of human activity!), from the period of 2013-01-01 through 2016-12-31.

Absenteeism

Our dataset includes 101112 absences and 1695920 presences, which can be broken down below in percentage terms.

Among absences, the breakdown of absence “type” is as follows.

Absenteeism rate

We calculate an absenteeism rate for any given time period (day, month, etc.) as the number of days not worked divided by the number of days which should have been worked, and multiplied by 100. The below shows the absenteeism rate, by day, for the entire observation period. The size of each dot is a reflection of the number of workers observed at that time.

Sick absenteeism rate

We again calculate the absenteeism rate, but only for those who absences which are classified as “sick leave”.

Monthly absenteeism

Daily absenteeism is problematic in that it introduces a great deal of noise. So, we examine monthly absenteeism rates for both all and sick absences. The below chart shows these metrics, as well as the percentage of monthly absences due to sickness, and the number of worker-days observed. A local regression smoothed line is overlaid to see overall trends.

Absenteeism by worker type

The below chart is a paneling of (a) type of absenteeism metric (columns) and (b) type of worker (rows).

Clinic data

Data structure

There are two clinical datasets: clinic and clinic_agg. The former is detailed at the individual level, but the latter covers a slightly wider timespan.

The clinic dataset has the following fields:

date
name
severity
month
year
dow
day
day_number
year_month

The clinic_agg dataset has the following fields:

month
year
group
tested
positive
negative
date
percent_positive

Cases over time - from clinic_agg

All workers

If we combine all workers, we can examine the total incidence of malaria since 2011.

By nationality

The below charts show the total number of tests and postive cases, by month, for Mozambican and foreign workers, respectively, at Ilovo-Maragra.

Cases and tests from clinic_agg

Absolute numbers

The below chart shows both the number of positive cases (all workers, in red) and the number of tests, by month.

Relative numbers

The below chart shows the same data as above, but convers the number of positive cases to a percentage of all tests, rather than an absolute number.

Seasonality data from clinic_agg

We can examine the annual seasonality of positive cases by overlaying all years’ data onto one axis.

The below chart uses the same data as above, but instead of positive cases, it shows the percent of tests which were positive.

If we aggregate and view distributions (via “violin” charts) at the level of the month, seasonality is more apparent.

Malaria severity distribution from clinic

The below chart shows the severity of all clinical malaria cases in the Maragra clinic.

The below chart shows severity, but over time.

Analysis

Effectiveness analysis

Identification strategy

For our purposes we are analyzing the effect of one intervention (IRS) on 2 outcomes (absence and illness) with many confounders (age, worker type, seasonality, etc.). Our analysis can be visualized formulaically as follows:

\[ \begin{equation} \operatorname{Pr}(\text{Outcome} = 1 \mid \text{X}) = \beta_{0} + \beta_{1} \text{Location} + \beta_{2} \text{Season} + (\beta_3{IRS}*\beta_4{IRS_t} + ... ) \end{equation} \]

Our outcome is probabilistic and binomial (ie, one is either absent or present / infected or not infected). Our demographic confounders (represented by \(...\)) will be a function of iterative model selection. Our intervention (IRS) is not a simple yes/no, but rather the product of whether the residence of the worker in question was treated in the last year, and, if so, the time since treatment (represented above as the interaction term, where where \(_t\) represents time elapsed since commencement of the most recent IRS campaign).

Propensity score matching

Since the full accounting of confounders would greatly reduce the degrees of freedom of our analysis, we employ propensity score matching to generate a matched sample of workers who are alike in characteristics and time, but not treatment. We do this by first estimating the likelihood of having ever received the intervention, given a worker’s age, sex, department and temporary vs. permanent status. We justify the necessity of this matching by noting that the differences between those workers who received IRS and those who did not (see table 1) are striking and in most cases statistically significant.

Table 1: Comparison of unmatched samples
IRS No IRS p
n 3395 10796
STATUS = Temporary (%) 3134 (92.3) 10142 (93.9) 0.001
DEPARTMENT (%) 0.001
Administrative 112 (3.3) 294 (2.7)
Factory 336 (9.9) 886 (8.2)
Field 2947 (86.8) 9616 (89.1)
AGE (mean (sd)) 35.34 (10.10) 36.12 (10.23) <0.001
SEX = M (%) 1947 (57.3) 6478 (60.0) 0.006
RECEIVED = No IRS (%) 0 (0.0) 10796 (100.0) <0.001

Having now demonstrated that are treatment and control groups are qualitatively different (and therefore require either statistical adjustment or a priori matching), we proceed to carry out the matching, using those best practices suggested by Ho et al 2004 “for improving parametric statistical models by preprocessing data with nonparametric matching methods” (Daniel Ho, Kosuke Imai, Gary King, and Elizabeth Stuart (2007). Matching as Nonparametric Preprocessing for Reducing Model Dependence in Parametric Causal Inference. Political Analysis 15(3): 199-236. http://gking.harvard.edu/files/abs/matchp-abs.shtml). We emply the nearest neighbor method for identifying those workers from our control group who most resemble those workers in the treatment group.

Our match is a 1-to-1 cut, meaning those control workers who do not resemble those in the treatment group are left out of primary analysis. The below table shows the match results.

Table 2: Sample sizes
Control Treated
All 10796 3395
Matched 3395 3395
Unmatched 7401 0
Discarded 0 0

The following output shows, that the distributions of our numeric variables are now extremely similar.

Table 3: Summary of balance for matched data
Means Treated Means Control SD Control Mean Diff
distance 0.24 0.24 0.03 0.00
age 35.34 35.23 10.07 0.11
sexF 0.43 0.40 0.49 0.02
sexM 0.57 0.60 0.49 -0.02
permanent_or_temporaryTemporary 0.92 0.92 0.28 0.01
departmentFactory 0.10 0.10 0.30 0.00
departmentField 0.87 0.86 0.35 0.01

The propensity scores can be visualized below.

Modeling

Having now created a matched sample of 6790 workers, of which 50% received IRS and 50% did not, we can confidently carry out our analysis on this sample. Since the propensity score matching effectively cancels out demographic differences, our model only need take into account those differences which are not at the person-level. In our case, these include seasonality (defined here by quarter) (later, will add other factors).

For the purposes of this first pass, we “bin” IRS exposure into 5 groups: never, ever but 180+ days ago, 90-80 days, 60-90 days, and in the last 60 days.

Having estimated our binomial logistic regression model, we examine the odds ratios for absence as a function of our predictive variables.

Variable OR Lower Upper
(Intercept) (Intercept) 0.1040704 0.1002489 0.1080021
days_since060-090 days_since060-090 1.0843424 1.0196666 1.1527044
days_since090-180 days_since090-180 0.9835209 0.9393110 1.0299224
days_since180+ days_since180+ 0.8265855 0.7899572 0.8650181
days_sinceNever days_sinceNever 1.0183795 0.9820170 1.0564411
quarter2 quarter2 0.4774871 0.4659236 0.4893122
quarter3 quarter3 0.4673738 0.4566575 0.4783256
quarter4 quarter4 0.6971541 0.6829098 0.7116959

We run the same model, but instead of estimating absences, we estimate only the likelihood of sick absences. The results (in form of odds ratios) are below.

Variable OR Lower Upper
(Intercept) (Intercept) 0.0097908 0.0088217 0.0108362
days_since060-090 days_since060-090 1.1837723 1.0059742 1.3896269
days_since090-180 days_since090-180 1.2094085 1.0716611 1.3666015
days_since180+ days_since180+ 1.0252734 0.9104943 1.1561815
days_sinceNever days_sinceNever 1.0844562 0.9829672 1.1998437
quarter2 quarter2 0.7972699 0.7493009 0.8482335
quarter3 quarter3 0.8279210 0.7809315 0.8777815
quarter4 quarter4 0.6656227 0.6265310 0.7071308

To do

  • More robust models and checks.
  • Limit ton only clinical malaria.
  • Un-bin IRS (ie, estimate semi-parametric curve of effectiveness waning)
  • More investigation of worker type, shift, etc.